-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid unpad/pad repeated calls when use_cache=False
#5
base: add-flash-attn-2
Are you sure you want to change the base?
Conversation
Benchmark script out of completeness: https://pastebin.com/zWE9Aedr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look overall great to me! I wonder if we can add padding_mask
inside flash_kwargs
. For me we are making the attention forward signature a bit more complicated but for the speedup we get it is great I think. Can you also confirm generate
with use_cache
works fine here?
I would also like to have a review from @ArthurZucker before merging this
This is a draft by the way, I just wanted to get results. I'm not sure if it is very fit for transformers though, with the modifications directly in the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice that you found a way to do this, but as you said not very transformers
like and bloated, especially if this is un-usable with use_cache=True
😢
* Cohere Model Release (#1) Cohere Model Release * Remove unnecessary files and code (#2) Some cleanup * Delete cohere-model directory (#3) * Make Fix (#5) * Pr fixes (#6) * fixes for pr * pr fixes for the format * pr fixes for the format * src/transformers/models/auto/tokenization_auto.py * Tokenizer test (huggingface#8) * tokenizer test * format fix * Adding Docs and other minor changes (huggingface#7) * Add modeling tests (huggingface#9) * Smol Fix (huggingface#11) * tokenization tests are fixed * format fixes * fix pr doc tests * fix pr doc tests * fix pr doc tests * fix pr style check * small changes in cohere.md * FIX: Address final comments for transformers integration (huggingface#13) * fix modeling final nits and add proper test file * for now leave empty tests * add integration test * push new test * fix modeling cohere (huggingface#14) * Update chat templates to use the new API (huggingface#15) --------- Co-authored-by: ahmetustun <[email protected]> Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: Matt <[email protected]>
As per title. The difference is quite large. This is only done out of curiosity, cc @younesbelkada
Note: Speedup over the base PR is expected only in case of batch_size > 1 when padding / masked tokens are used. In the benchmark below, we use a padding percentage of 30%.
This is on a single A100 for
meta-llama/Llama-2-7b-hf
.Forward only with no_grad mode
batch_size=4, len=1000
batch_size=4, len=2000
batch_size=8, len=500
batch_size=2, len=4000
forward + backward
batch_size=4, len=1500
batch_size=2, len=3000
batch_size=2, len=1000